@allison_horst@allison_horstModule 01: Introduction to R and Data Wrangling
School of Mathematical and Physical Sciences
R
R
@allison_horst@allison_horstIntroduction to R and RStudio
R is a free language and environment for statistical computing and graphics.R is modular — most functionality is from add-on packages. So the language can be thought of as a platform for creating and running a large number of useful packages.R & RStudio on your computer,
R
R.
RStudio
RStudio Desktop (Open Source License) version for your operating system.R is running, variables, data, functions, results, etc., are stored in memory on the computer in the form of objects that have a name.operators (arithmetic, logical, comparison, etc.) and functions (which are themselves objects).R stores results in an object (a data structure), an analysis can be done with no result displayed. Such a feature is very useful, since a user can extract only that part of the results that is of interest and can pass results into further analyses.@allison_horstR.R is like a car’s engine while RStudio is like a car’s dashboard. Much as we don’t drive a car by interacting directly with the engine but rather by interacting with elements on the car’s dashboard, we won’t be using R directly but rather we will use RStudio’s interface.RStudio is a free, open source, and R-specific integrated development environment (IDE).RStudio:
RStudio is the same as the one you would get if you opened the R application or you just typed in “R” in your command line environment.R reads and interprets what you’ve written and tries to execute it;R prints a result to the console.RStudio go to File > New Project..Rproj file for your project and will automatically change your working directory to the workshop materials directory.@allison_horstR and its packagesR function come in packages, free libraries of code written by R’s active user community.R package, open an R session and type at the Console:R will download the package from CRAN, so you’ll need to be connected to the internet.R session by runningpak package, you would run:pak?pak provides a fresh approach to R package installation.pak installs R packages from CRAN, Bioconductor, GitHub, URLs, git repositories, local files and directories with a single funciton.pak is:
instead of
unless pak::pak() gives you an error.
R Objects – Vectors
Tip
What you use is up to you, but be consistent, and remember that you’re likely going to be typing these out quite a few times, so try to be concise too.
An atomic vector is just a simple vector of data. You can make an atomic vector by combining multiple elements with c():
[1] 1 2 3 4 5 6
[1] TRUE
[1] "double"
[1] 6
Note
<-.object_name <- valueLogical vectors are the simplest type of atomic vector because they can take only three possible values: FALSE, TRUE, and NA. Logical vectors are usually constructed with comparison operators.
Comparison operators include: >, >=, <, <=, != (not equal), and == (equal).
[1] 1 2 3 4 5 6 7 8 9 10
[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
Warning
= instead of == when testing for equality.= is an assignment operator (but <- is the preferred assignment operator).[1] "Hello" "World"
[1] "character"
[1] "1" "2" "three"
[1] "character"
Note
Logicals, numerics, and characters are the most common types of atomic vectors in R, but R also recognises two more types: complex and raw.
tidyverse
tidyverse_update().tidyverse
The tidyverse is a set of packages that work in harmony because they share an underlying design philosophy, grammar, and data representations.
The core tidyverse includes the packages that you’re likely to use in everyday data analyses.
As of tidyverse 2.0.0, the following packages are included in the core tidyverse.
tidyverse (Cont.)library(tidyverse) will load the core tidyverse packages:
ggplot2 for data visualisationdplyr for data manipulationtidyr for data tidyingreadr for data importpurrr for functional programmingtibble for tibbles, a modern re-imagining of data framesstringr for stringsforcats for factorslubridate for dates and timestidyverse to Phases of the Data Science CycleTidy Data and Data Wrangling
R (and within tidyverse of @wickhamTidyverseEasilyInstall2022).Artwork by @allison_horst
Artwork by @allison_horst
tibble VS data.frame
A tibble is often considered a neater format of a data frame, and it is often used in the tidyverse packages.
It contains the same information as a data frame, but the manipulation and representation of tibbles is different from data frames in some aspects.
Comparing to data.frame(), tibble() does much less (and probably complain much more!)
@allison_horsttibble VS data.frame (Cont.)as_tibble()
[1] "data.frame"
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
data wrangling interchangeably with data manipulation.Artwork by @allison_horst
# A tibble: 50 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 40 more rows
Note
[ with a data frame, but the most important way is to select rows and columns independently with df[rows, cols].df[rows, ] and df[ ,cols] select just rows or just columns, using the empty subset to preserve the other dimension.dplyr::filter()
Artwork by @allison_horst
dplyr::filter()(Cont.)Note
filter() is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values.tidyverse packages import the magrittr package to use %>%.x %>% f(y) turns into f(x, y) so the result from one step is then “piped” into the next step. You can use the pipe to rewrite multiple operations that you can read left-to-right, top-to-bottom.%>%, read it as “and then”.dplyr::slice()
# A tibble: 3 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
Tip
slice_head() and slice_tail() select the first or last rowsslice_sample() randomly selects rowsslice_min() and slice_max() select rows with highest or lowest values of a variable# A tibble: 50 × 2
Sepal.Length Sepal.Width
<dbl> <dbl>
1 5.1 3.5
2 4.9 3
3 4.7 3.2
4 4.6 3.1
5 5 3.6
6 5.4 3.9
7 4.6 3.4
8 5 3.4
9 4.4 2.9
10 4.9 3.1
# ℹ 40 more rows
dplyr::select()
dplyr: A Grammar of Data Manipulationdplyr aims to provide a function for each basic verb of data manipulation.filter() chooses rows based on column values.slice() chooses rows based on location.arrange() changes the order of the rows.select() changes whether or not a column is included.rename() changes the name of columns.mutate() changes the values of columns and creates new columns.relocate() changes the order of the columns.summarise() collapses a group into a single row.mtcars dataset.Tip
Get some help by typing ?dplyr::arrange() in Console
dplyr::mutate()
Artwork by @allison_horst
dplyr::mutate() (Cont.)# A tibble: 6 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepal_ratio
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.46
2 4.9 3 1.4 0.2 setosa 1.63
3 4.7 3.2 1.3 0.2 setosa 1.47
4 4.6 3.1 1.5 0.2 setosa 1.48
5 5 3.6 1.4 0.2 setosa 1.39
6 5.4 3.9 1.7 0.4 setosa 1.38
Tip
case_when() and if_else() are useful for conditional mutation.
tidyselect: Selection languageA backend for the selecting functions of the tidyverse. It provides helpers for selecting variables.
: for selecting contiguous variables.! for taking complement set of variables.& or | for selecting intersection or union of two set of variables.starts_with() selects columns with the given prefix.ends_with() selects columns with the given suffix.everything() to select all variables.last_col() to select last variable, with option of an offset.contains() selects columns with a literal string.all_of() for selecting columns based on a character vector.babynames, which comes in a package that is also named babynames.babynames, you will find information about almost every name given to children in the United States since 1880.# A tibble: 1,924,665 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
# ℹ 1,924,655 more rows
filter out
prop is greater than or equal to 0.08;Khaleesi;Sea;? '%in%'
Join Data Sets
flight dataset in the nycflights13 package provides some relevant information.# pak::pak("nycflights13") # only if you never use this package previously
library(nycflights13)
flights# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 16 × 2
carrier avg_delay
<chr> <dbl>
1 F9 21.9
2 FL 20.1
3 EV 15.8
4 YV 15.6
5 OO 11.9
6 MQ 10.8
7 WN 9.65
8 B6 9.46
9 9E 7.38
10 UA 3.56
11 US 2.13
12 VX 1.76
13 DL 1.64
14 AA 0.364
15 HA -6.92
16 AS -9.93
F9 had the worst record for delays in the New York City area in 2013.F9?# A tibble: 16 × 2
carrier avg_delay
<chr> <dbl>
1 F9 21.9
2 FL 20.1
3 EV 15.8
4 YV 15.6
5 OO 11.9
6 MQ 10.8
7 WN 9.65
8 B6 9.46
9 9E 7.38
10 UA 3.56
11 US 2.13
12 VX 1.76
13 DL 1.64
14 AA 0.364
15 HA -6.92
16 AS -9.93
nycflights13 package comes with another data set, airlines, which matches the name of each airline to its carrier code.# A tibble: 16 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
F9 manually every time, you probably don’t want to do it that way.dplyr’s four join functions: left_join(), right_join(), full_join(), and inner_join().band_members and band_instruments, which look like this (the datasets are named band & instruments respectively in the images below):Note
name.To see the raw data, you can run the following code (they are part of dplyr)
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
left_join()
left_join() function returns a copy of a data set that is augmented with information from a second data set.
Important
left_join() (cont.)R, we do:# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
The by argument specifies the column that the two datasets have in common.
What if the column names are different in the two datasets?
left_join() (cont.)join_by() function for the by argumentby argument can be omitted for succinctnessright_join()
right_join() does the opposite of left_join();
Important
right_join() (cont.)R, we do:full_join()
full_join() retains every row from each data sets, inserting NA placeholders throughout the results as necessary.Important
full_join() (cont.)R, we do:inner_join()
inner_join() only retains row that appear in both datasets.Important
inner_join() (cont.)R, we do:airlines in a way that keeps every row of the results, but only the matching rows of airlines.name and avg_delay columns in that order.# A tibble: 16 × 2
name avg_delay
<chr> <dbl>
1 Frontier Airlines Inc. 21.9
2 AirTran Airways Corporation 20.1
3 ExpressJet Airlines Inc. 15.8
4 Mesa Airlines Inc. 15.6
5 SkyWest Airlines Inc. 11.9
6 Envoy Air 10.8
7 Southwest Airlines Co. 9.65
8 JetBlue Airways 9.46
9 Endeavor Air Inc. 7.38
10 United Air Lines Inc. 3.56
11 US Airways Inc. 2.13
12 Virgin America 1.76
13 Delta Air Lines Inc. 1.64
14 American Airlines Inc. 0.364
15 Hawaiian Airlines Inc. -6.92
16 Alaska Airlines Inc. -9.93
Manipulating date and time
Dealing with dates alone is relatively straightforward when compared to compared to date and time .
Dealing with date and time simultaneously is more tricky
Let’s start with just dates first
Date 📅 even though it looks like character 🔢Date objectsas.Date to convert objects to Date 📅?strptime but some depends on your operating system
%b abbreviated month%B full month%e day of the month (01, 02, …, 31)%d day of the month (1, 2, …, 31)%y year without century (00-99)%Y year with century, e.g. 1999POSIXct and POSIXlt - try avoid using POSIXlt if possiblePOSIX stands for Portable Operating System Interfacect stands for calendar time[1] "2020-12-02 13:00:00 AEDT"
[1] 1606874400
attr(,"tzone")
[1] ""
POSIXlt seems like it’s the same as POSIXct
syd <- as.POSIXct("2023-04-20 13:00",
format = "%Y-%m-%e %H:%M",
tz = "Australia/Sydney"
) #<<
perth <- as.POSIXct("2023-04-20 13:00",
format = "%Y-%m-%e %H:%M",
tz = "Australia/Perth"
) #<<
syd - perthTime difference of -2 hours
OlsonNames().Working with lubridate
lubridateLubridate makes it easier to do the things R does with date-times and possible to do the things R does not.Artwork by @allison_horst
lubridate
Date, you can use ymd and friends. E.g.[1] "2012-12-30"
[1] "1999-01-30"
[1] "2015-01-01"
ymd and friends
y = yearm = month, andd = day.lubridate
POSIXct, you can use ymd_hms and friends[1] "2014-01-01 20:10:01 AEDT"
[1] "2010-09-09 16:00:00 UTC"
[1] "2009-09-09 16:30:00 UTC"
[1] "2019-07-09 04:30:03 UTC"
[1] NA
ymd_hms and friends
h = hour; m = minute, and s = second.lubridate
Date from individual date components:POSIXct from individual components:lubridate
[1] Oct
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
lubridate
Duration is a special class in lubridate which represents an exact number of seconds under the hood.Duration are:lubridate
[1] "950400s (~1.57 weeks)"
[1] "2013-01-06"
[1] "2024-04-07 02:30:00 AEDT"
lubridate
Period is another special class in lubridate which represent human units like weeks and months and without a fixed unit.
Period are like for Duration but without the prefix “d”:lubridate (cont.)[1] "11d 0H 0M 0S"
[1] "2013-01-06"
[1] "2023-04-04 02:00:00 AEST"
[1] "2023-04-02 02:00:00 AEDT"
[1] "2023-04-02 03:00:00 AEST"
durations and periods?durations and periods? NewYears MLKing GWBirthday Memorial Juneteenth Independence
20240101 20240115 20240219 20240527 20240619 20240704
Labor Columbus Veterans Thanksgiving Christmas
20240902 20241014 20241111 20241128 20241225
# A tibble: 7 × 2
holiday date
<chr> <date>
1 New Year's Day 2024-01-01
2 Australia Day 2024-01-26
3 Good Friday 2024-03-29
4 Easter Monday 2024-04-01
5 ANZAC Day 2024-04-25
6 Christmas Day 2024-12-25
7 Boxing Day 2024-12-26
[1] 1 1 1 2 2 3 3 4 4 4 4
[1] "winter" "winter" "winter" "spring" "summer" "summer" "autumn" "autumn"
[9] "autumn" "autumn" "winter"
oz_climate data (oz_climate.csv) contains result from a survey about attitude towards climate change in Australia.# A tibble: 1,927 × 200
RespondentID StartDate EndDate q1_1 q1_2 q1_3 q1_4 q1_5 q1_6 q1_7 q2_1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1502636269 08/04/20… 08/04/… Stro… Stro… Unsu… Mild… Unsu… Mild… Unsu… Unsu…
2 1502666184 08/04/20… 08/04/… Mild… Unsu… Mild… Mild… Mild… Mild… Stro… Unsu…
3 1502686727 08/04/20… 08/04/… Stro… Mild… Stro… Mild… Mild… Stro… Stro… Mild…
4 1502731096 08/04/20… 08/04/… Stro… Stro… Mild… Stro… Stro… Mild… Mild… Mild…
5 1502742259 08/04/20… 08/04/… Stro… Mild… Mild… Unsu… Stro… Mild… Stro… Stro…
# ℹ 1,922 more rows
# ℹ 189 more variables: q2_2 <chr>, q2_3 <chr>, q2_4 <chr>, q2_5 <chr>,
# q2_6 <chr>, q2_7 <chr>, q2_8 <chr>, q3_1 <chr>, q3_2 <chr>, q3_3 <chr>,
# q3_4 <chr>, q3_5 <chr>, q3_6 <chr>, q3_7 <chr>, q4 <chr>, q5 <chr>,
# q6 <chr>, q7 <chr>, q7_other <chr>, q8 <chr>, q9 <chr>, q10 <chr>,
# q11 <chr>, q12 <chr>, q13 <chr>, q14_1 <chr>, q14_2 <chr>, q15 <chr>,
# q16_1 <chr>, q16_2 <chr>, q16_3 <chr>, q16_4 <chr>, q16_5 <chr>, …
oz_climate_qbook data (oz_climate_qbook.csv) contains the translation of the column label in oz_climate to the actual question asked# A tibble: 200 × 2
code desc
<chr> <chr>
1 RespondentID RespondentID
2 StartDate StartDate
3 EndDate EndDate
4 q1_1 We are approaching the limit of the number of people the earth c…
5 q1_2 Humans have the right to modify the natural environment to suit …
# ℹ 195 more rows
10:00
Compute the five number summary for the time taken to complete the survey for oz_climate by filling in … below.
Filter oz_climate to surveys that were completed on or after August 13th 2011.
Convert each string below to appropriate date, time or datetime objects.
R and RStudio
R, particularly data.frame and tibble
dplyr
School of Mathematical and Physical Sciences